raw text
OpenShape: Scaling Up3D Shape Representation Towards Open-World Understanding - Supplementary Material Anonymous Author(s) Affiliation Address email 1 More Examples of Multi-Modal 3D Shape Retrieval
We leverage the metadata from the four datasets to generate the raw texts. Objaverse:We utilize the name associated with each shape to serve as the text. In this way, we generate one or more raw texts for each shape. I am analyzing a 3D dataset with various text descriptions for the 3D models. If a text contains a clear noun (or noun phrase) that could potentially describe a 3D object, please respond with "Y".
- Asia > Japan (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (11 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (0.88)
- Asia > Japan (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (11 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (0.88)
OpenShape: Scaling Up3D Shape Representation Towards Open-World Understanding - Supplementary Material Anonymous Author(s) Affiliation Address email 1 More Examples of Multi-Modal 3D Shape Retrieval
We leverage the metadata from the four datasets to generate the raw texts. Objaverse:We utilize the name associated with each shape to serve as the text. In this way, we generate one or more raw texts for each shape. I am analyzing a 3D dataset with various text descriptions for the 3D models. If a text contains a clear noun (or noun phrase) that could potentially describe a 3D object, please respond with "Y".
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Nguyen, Thao, Li, Yang, Golovneva, Olga, Zettlemoyer, Luke, Oh, Sewoong, Schmidt, Ludwig, Li, Xian
Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data. We make our high-quality synthetic data publicly available at https://huggingface.co/datasets/facebook/recycling_the_web.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Virginia (0.04)
- Research Report > New Finding (0.66)
- Research Report > Experimental Study (0.46)
A Comprehensive Analysis on LLM-based Node Classification Algorithms
Wu, Xixi, Shen, Yifei, Ge, Fangzhou, Shan, Caihua, Jiao, Yizhu, Sun, Xiangguo, Cheng, Hong
Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLM-based approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes ten datasets, eight LLM-based algorithms, and three learning paradigms, and is designed for easy extension with new methods and datasets. Subsequently, we conducted extensive experiments, training and evaluating over 2,200 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size) that affect performance. Our findings uncover eight insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future studies in this field. Codes and datasets are released at \href{https://llmnodebed.github.io/}{https://llmnodebed.github.io/}.
- Oceania > Australia (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > Russia (0.04)
- (4 more...)
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Cheng, Daixuan, Gu, Yuxian, Huang, Shaohan, Bi, Junyu, Huang, Minlie, Wei, Furu
Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.
- Asia > Middle East > Jordan (0.04)
- Asia > India > Uttar Pradesh (0.04)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- Education (1.00)
- Health & Medicine (0.94)
Intruding with Words: Towards Understanding Graph Injection Attacks at the Text Level
Lei, Runlin, Hu, Yuwei, Ren, Yuchen, Wei, Zhewei
Graph Neural Networks (GNNs) excel across various applications but remain vulnerable to adversarial attacks, particularly Graph Injection Attacks (GIAs), which inject malicious nodes into the original graph and pose realistic threats. Text-attributed graphs (TAGs), where nodes are associated with textual features, are crucial due to their prevalence in real-world applications and are commonly used to evaluate these vulnerabilities. However, existing research only focuses on embedding-level GIAs, which inject node embeddings rather than actual textual content, limiting their applicability and simplifying detection. In this paper, we pioneer the exploration of GIAs at the text level, presenting three novel attack designs that inject textual content into the graph. Through theoretical and empirical analysis, we demonstrate that text interpretability, a factor previously overlooked at the embedding level, plays a crucial role in attack strength. Among the designs we investigate, the Word-frequency-based Text-level GIA (WTGIA) is particularly notable for its balance between performance and interpretability. Despite the success of WTGIA, we discover that defenders can easily enhance their defenses with customized text embedding methods or large language model (LLM)--based predictors. These insights underscore the necessity for further research into the potential and practical significance of text-level GIAs.
- Asia > China (0.04)
- Asia > Japan (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (12 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
UniGraph: Learning a Cross-Domain Graph Foundation Model From Natural Language
Foundation models like ChatGPT and GPT-4 have revolutionized artificial intelligence, exhibiting remarkable abilities to generalize across a wide array of tasks and applications beyond their initial training objectives. However, when this concept is applied to graph learning, a stark contrast emerges. Graph learning has predominantly focused on single-graph models, tailored to specific tasks or datasets, lacking the ability to transfer learned knowledge to different domains. This limitation stems from the inherent complexity and diversity of graph structures, along with the different feature and label spaces specific to graph data. In this paper, we present our UniGraph framework, designed to train a graph foundation model capable of generalizing to unseen graphs and tasks across diverse domains. Unlike single-graph models that use pre-computed node features of varying dimensions as input, our approach leverages Text-Attributed Graphs (TAGs) for unifying node representations. We propose a cascaded architecture of Language Models (LMs) and Graph Neural Networks (GNNs) as backbone networks with a self-supervised training objective based on Masked Graph Modeling (MGM). We introduce graph instruction tuning using Large Language Models (LLMs) to enable zero-shot prediction ability. Our comprehensive experiments across various graph learning tasks and domains demonstrate the model's effectiveness in self-supervised representation learning on unseen graphs, few-shot in-context transfer, and zero-shot transfer, even surpassing or matching the performance of GNNs that have undergone supervised training on target datasets.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering
Chen, Xiusi, Jiang, Jyun-Yu, Chang, Wei-Cheng, Hsieh, Cho-Jui, Yu, Hsiang-Fu, Wang, Wei
Few-shot question answering (QA) aims at achieving satisfactory results on machine question answering when only a few training samples are available. Recent advances mostly rely on the power of pre-trained large language models (LLMs) and fine-tuning in specific settings. Although the pre-training stage has already equipped LLMs with powerful reasoning capabilities, LLMs still need to be fine-tuned to adapt to specific domains to achieve the best results. In this paper, we propose to select the most informative data for fine-tuning, thereby improving the efficiency of the fine-tuning process with comparative or even better accuracy on the open-domain QA task. We present MinPrompt, a minimal data augmentation framework for open-domain QA based on an approximate graph algorithm and unsupervised question generation. We transform the raw text into a graph structure to build connections between different factual sentences, then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text. We then generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model. Empirical results on several benchmark datasets and theoretical analysis show that MinPrompt is able to achieve comparable or better results than baselines with a high degree of efficiency, bringing improvements in F-1 scores by up to 27.5%.
- North America > United States > California > Los Angeles County > Los Angeles (0.17)
- Europe > Italy > Tuscany > Florence (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Leisure & Entertainment > Sports > Basketball (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)